Making predictions over amazon fine food reviews dataset

Predictions

The purpose of this analysis is to make up a prediction model where we will be able to predict whether a recommendation is positive or negative. In this analysis, we will not focus on the Score, but only the positive/negative sentiment of the recommendation.

To do so, we will work on Amazon's recommendation dataset, we will build a Term-doc incidence matrix using term frequency and inverse document frequency ponderation. When the data is ready, we will load it into predicitve algorithms, mainly naïve Bayesian and regression.

In the end, we hope to find a "best" model for predicting the recommendation's sentiment.

Loading the data

In order to load the data, we will use the SQLITE dataset where we will only fetch the Score and the recommendation summary.

As we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "postive". Otherwise, it will be set to "negative".

The data will be split into an training set and a test set with a test set ratio of 0.2



In [2]:

    
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

Let's first check whether we have the dataset available:



In [3]:

    
import os
from IPython.core.display import display, HTML
    
if not os.path.isfile('database.sqlite'):
    display(HTML("<h3 style='color: red'>Dataset database missing!</h3><h3> Please download it "+
          "<a href='https://www.kaggle.com/snap/amazon-fine-food-reviews'>from here on Kaggle</a> "+
          "and extract it to the current directory."))
    raise(Exception("missing dataset"))









    




Dataset database missing!
 Please download it from here on Kaggle and extract it to the current directory.






    



---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-86b8e421874f> in <module>()
      6           "<a href='https://www.kaggle.com/snap/amazon-fine-food-reviews'>from here on Kaggle</a> "+
      7           "and extract it to the current directory."))
----> 8     raise(Exception("missing dataset"))
      9 

Exception: missing dataset



In [ ]:

    
con = sqlite3.connect('database.sqlite')

pd.read_sql_query("SELECT * FROM Reviews LIMIT 3", con)

Let's select only what's of interest to us:



In [ ]:

    
messages = pd.read_sql_query("""
SELECT 
  Score, 
  Summary, 
  HelpfulnessNumerator as VotesHelpful, 
  HelpfulnessDenominator as VotesTotal
FROM Reviews 
WHERE Score != 3""", con)

Let's see what we've got:



In [ ]:

    
messages.head(5)

Let's add the Sentiment column that turns the numeric score into either positive or negative.

Similarly, the Usefulness column turns the number of votes into a boolean.



In [ ]:

    
messages["Sentiment"] = messages["Score"].apply(lambda score: "positive" if score > 3 else "negative")
messages["Usefulness"] = (messages["VotesHelpful"]/messages["VotesTotal"]).apply(lambda n: "useful" if n > 0.8 else "useless")

messages.head(5)

Let's have a look at some 5s:



In [ ]:

    
messages[messages.Score == 5].head(10)

And some 1s as well:



In [ ]:

    
messages[messages.Score == 1].head(10)

Extracting features from text data

SciKit cannot work with words, so we'll just assign a new dimention to each word and work with word counts.

See more here: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html



In [9]:

    
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import re
import string
import nltk

cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
    sentence = sentence.lower()
    sentence = cleanup_re.sub(' ', sentence).strip()
    #sentence = " ".join(nltk.word_tokenize(sentence))
    return sentence

messages["Summary_Clean"] = messages["Summary"].apply(cleanup)

train, test = train_test_split(messages, test_size=0.2)
print("%d items in training data, %d in test data" % (len(train), len(test)))









    



420651 items in training data, 105163 in test data



In [10]:

    
from wordcloud import WordCloud, STOPWORDS

# To cleanup stop words, add stop_words = STOPWORDS
# But it seems to function better without it
count_vect = CountVectorizer(min_df = 1, ngram_range = (1, 4))
X_train_counts = count_vect.fit_transform(train["Summary_Clean"])

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_new_counts = count_vect.transform(test["Summary_Clean"])
X_test_tfidf = tfidf_transformer.transform(X_new_counts)

y_train = train["Sentiment"]
y_test = test["Sentiment"]

prediction = dict()

Let's get fancy with WordClouds!



In [11]:

    
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

#mpl.rcParams['figure.figsize']=(8.0,6.0)    #(6.0,4.0)
mpl.rcParams['font.size']=12                #10 
mpl.rcParams['savefig.dpi']=100             #72 
mpl.rcParams['figure.subplot.bottom']=.1 


def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))
    
    fig = plt.figure(1, figsize=(8, 8))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
    
show_wordcloud(messages["Summary_Clean"])

We can also view wordclouds for only positive or only negative entries:



In [12]:

    
show_wordcloud(messages[messages.Score == 1]["Summary_Clean"], title = "Low scoring")



In [13]:

    
show_wordcloud(messages[messages.Score == 5]["Summary_Clean"], title = "High scoring")

Create a Multinomial Naïve Bayes model



In [14]:

    
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train_tfidf, y_train)
prediction['Multinomial'] = model.predict(X_test_tfidf)

Create a Bernoulli Naïve Bayes model



In [15]:

    
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train_tfidf, y_train)
prediction['Bernoulli'] = model.predict(X_test_tfidf)

Create a Logistic Regression model



In [16]:

    
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg_result = logreg.fit(X_train_tfidf, y_train)
prediction['Logistic'] = logreg.predict(X_test_tfidf)

Create a Linear SVC model



In [17]:

    
from sklearn.svm import LinearSVC
linsvc = LinearSVC(C=1e5)
linsvc_result = linsvc.fit(X_train_tfidf, y_train)
prediction['LinearSVC'] = linsvc.predict(X_test_tfidf)

Analyzing Results

Before analyzing the results, let's remember what Precision and Recall are (more here https://en.wikipedia.org/wiki/Precision_and_recall)

ROC Curves

In order to compare our learning algorithms, let's build the ROC curve. The curve with the highest AUC value will show our "best" algorithm.

In first data cleaning, stop-words removal has been used, but the results were much worse. Reason for this result could be that when people want to speak about what is or is not good, they use many small words like "not" for instance, and these words will typically be tagged as stop-words, and will be removed. This is why in the end, it was decided to keep the stop-words. For those who would like to try it by themselves, I have let the stop-words removal as a comment in the cleaning part of the analysis.



In [18]:

    
def formatt(x):
    if x == 'negative':
        return 0
    return 1
vfunc = np.vectorize(formatt)

cmp = 0
colors = ['b', 'g', 'y', 'm', 'k']
for model, predicted in prediction.items():
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test.map(formatt), vfunc(predicted))
    roc_auc = auc(false_positive_rate, true_positive_rate)
    plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC %0.2f'% (model,roc_auc))
    cmp += 1

plt.title('Classifiers comparison with ROC')
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

After plotting the ROC curve, it would appear that the Logistic regression method provides us with the best results, although the AUC value for this method is not outstanding...

I looks like the best are LogisticRegression and LinearSVC. Let's see the accuracy, recall and confusion matrix for these models:



In [19]:

    
for model_name in ["Logistic", "LinearSVC"]:
    print("Confusion matrix for %s" % model_name)
    print(metrics.classification_report(y_test, prediction[model_name], target_names = ["positive", "negative"]))
    print()









    



Confusion matrix for Logistic
             precision    recall  f1-score   support

   positive       0.88      0.84      0.86     16436
   negative       0.97      0.98      0.97     88727

avg / total       0.96      0.96      0.96    105163


Confusion matrix for LinearSVC
             precision    recall  f1-score   support

   positive       0.83      0.85      0.84     16436
   negative       0.97      0.97      0.97     88727

avg / total       0.95      0.95      0.95    105163



In [20]:

    
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues, labels=["positive", "negative"]):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45)
    plt.yticks(tick_marks, labels)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
# Compute confusion matrix
cm = confusion_matrix(y_test, prediction['Logistic'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm)    

cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
plt.show()

Let's also have a look at what the best & words are by looking at the coefficients:



In [21]:

    
words = count_vect.get_feature_names()
feature_coefs = pd.DataFrame(
    data = list(zip(words, logreg_result.coef_[0])),
    columns = ['feature', 'coef'])

feature_coefs.sort_values(by='coef')









    Out[21]:






  
    
      
      feature
      coef
    
  
  
    
      967049
      worst
      -44.423977
    
    
      983288
      yuck
      -32.798962
    
    
      820601
      terrible
      -30.725437
    
    
      422085
      horrible
      -29.568083
    
    
      61664
      awful
      -29.549015
    
    
      587861
      not
      -29.517080
    
    
      227249
      disgusting
      -25.736642
    
    
      569844
      nasty
      -25.434613
    
    
      966932
      worse
      -23.675443
    
    
      771474
      stale
      -23.175920
    
    
      55684
      at best
      -23.045667
    
    
      743490
      sick
      -22.671042
    
    
      937834
      weak
      -22.439037
    
    
      64217
      bad
      -22.389412
    
    
      393691
      gross
      -22.329017
    
    
      668716
      poor
      -22.136753
    
    
      518607
      low quality
      -21.973189
    
    
      443233
      instead
      -21.778065
    
    
      222437
      didn
      -21.766492
    
    
      804948
      tasteless
      -21.114726
    
    
      58868
      avoid
      -20.743722
    
    
      697967
      rancid
      -20.091260
    
    
      598406
      not very good
      -20.060497
    
    
      348817
      good and tangy
      -18.965442
    
    
      597882
      not too good
      -18.902264
    
    
      582723
      no flavor
      -18.842567
    
    
      983599
      yuk
      -18.803222
    
    
      669276
      poorly
      -17.863233
    
    
      546986
      moldy
      -17.835281
    
    
      440440
      inedible
      -17.749561
    
    
      ...
      ...
      ...
    
    
      272176
      fabulous
      17.914353
    
    
      589293
      not burnt
      18.011484
    
    
      97636
      better than
      18.186750
    
    
      590191
      not disappointed
      18.459886
    
    
      516173
      loves
      18.656262
    
    
      593258
      not like cardboard
      19.036431
    
    
      751726
      smooth
      19.144529
    
    
      212288
      delicious
      19.151504
    
    
      592145
      not greasy
      19.260567
    
    
      285714
      finally
      19.447863
    
    
      591578
      not from china
      19.477991
    
    
      4758
      addictive
      19.521455
    
    
      590828
      not expired
      19.645489
    
    
      593574
      not made in china
      19.661448
    
    
      278989
      favorite
      20.092982
    
    
      412574
      heaven
      20.168960
    
    
      348222
      good
      20.416811
    
    
      597821
      not too
      20.731870
    
    
      17668
      amazing
      20.782510
    
    
      508455
      love
      22.401286
    
    
      274917
      fantastic
      24.135847
    
    
      59431
      awesome
      24.517802
    
    
      598472
      not very salty
      24.682914
    
    
      596011
      not so bad
      24.806706
    
    
      655724
      perfect
      25.407171
    
    
      961404
      wonderful
      26.585667
    
    
      589143
      not bitter
      26.717856
    
    
      588671
      not bad
      35.914588
    
    
      368355
      great
      37.114205
    
    
      81362
      best
      42.539952
    
  

988938 rows × 2 columns



In [22]:

    
def test_sample(model, sample):
    sample_counts = count_vect.transform([sample])
    sample_tfidf = tfidf_transformer.transform(sample_counts)
    result = model.predict(sample_tfidf)[0]
    prob = model.predict_proba(sample_tfidf)[0]
    print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(), prob[0], prob[1]))

test_sample(logreg, "The food was delicious, it smelled great and the taste was awesome")
test_sample(logreg, "The whole experience was horrible. The smell was so bad that it literally made me sick.")
test_sample(logreg, "The food was ok, I guess. The smell wasn't very good, but the taste was ok.")









    



Sample estimated as POSITIVE: negative prob 0.000921, positive prob 0.999079
Sample estimated as NEGATIVE: negative prob 0.999997, positive prob 0.000003
Sample estimated as POSITIVE: negative prob 0.245712, positive prob 0.754288

Now let's try to predict how helpful a review is



In [23]:

    
show_wordcloud(messages[messages.Usefulness == "useful"]["Summary_Clean"], title = "Useful")
show_wordcloud(messages[messages.Usefulness == "useless"]["Summary_Clean"], title = "Useless")

Nothing seems to pop out.. let's try to limit the dataset to only entries with at least 10 votes.



In [24]:

    
messages_ufn = messages[messages.VotesTotal >= 10]
messages_ufn.head()









    Out[24]:






  
    
      
      Score
      Summary
      VotesHelpful
      VotesTotal
      Sentiment
      Usefulness
      Summary_Clean
    
  
  
    
      32
      4
      Best of the Instant Oatmeals
      19
      19
      positive
      useful
      best of the instant oatmeals
    
    
      33
      4
      Good Instant
      13
      13
      positive
      useful
      good instant
    
    
      75
      5
      Forget Molecular Gastronomy - this stuff rocke...
      15
      15
      positive
      useful
      forget molecular gastronomy this stuff rockes ...
    
    
      145
      5
      tastes very fresh
      17
      19
      positive
      useful
      tastes very fresh
    
    
      195
      1
      CHANGED FORMULA MAKES CATS SICK!!!!
      3
      10
      negative
      useless
      changed formula makes cats sick

Now let's try again with the word clouds:



In [25]:

    
show_wordcloud(messages_ufn[messages_ufn.Usefulness == "useful"]["Summary_Clean"], title = "Useful")
show_wordcloud(messages_ufn[messages_ufn.Usefulness == "useless"]["Summary_Clean"], title = "Useless")

This seems a bit better, let's see if we can build a model though



In [26]:

    
from sklearn.pipeline import Pipeline

train_ufn, test_ufn = train_test_split(messages_ufn, test_size=0.2)

ufn_pipe = Pipeline([
    ('vect', CountVectorizer(min_df = 1, ngram_range = (1, 4))),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(C=1e5)),
])

ufn_result = ufn_pipe.fit(train_ufn["Summary_Clean"], train_ufn["Usefulness"])

prediction['Logistic_Usefulness'] = ufn_pipe.predict(test_ufn["Summary_Clean"])
print(metrics.classification_report(test_ufn["Usefulness"], prediction['Logistic_Usefulness']))









    



             precision    recall  f1-score   support

     useful       0.84      0.88      0.86      2998
    useless       0.76      0.69      0.72      1615

avg / total       0.81      0.81      0.81      4613

Let's also see which of the reviews are rated by our model as most helpful and least helpful:



In [27]:

    
ufn_scores = [a[0] for a in ufn_pipe.predict_proba(train_ufn["Summary"])]
ufn_scores = zip(ufn_scores, train_ufn["Summary"], train_ufn["VotesHelpful"], train_ufn["VotesTotal"])
ufn_scores = sorted(ufn_scores, key=lambda t: t[0], reverse=True)

# just make this into a DataFrame since jupyter renders it nicely:
pd.DataFrame(ufn_scores)









    Out[27]:






  
    
      
      0
      1
      2
      3
    
  
  
    
      0
      1.000000e+00
      best
      13
      13
    
    
      1
      9.999999e-01
      Great for Baking
      10
      11
    
    
      2
      9.999999e-01
      Great for baking!
      10
      10
    
    
      3
      9.999999e-01
      Great for Baking
      23
      23
    
    
      4
      9.999999e-01
      Great for baking
      9
      10
    
    
      5
      9.999999e-01
      Great for baking
      32
      34
    
    
      6
      9.999999e-01
      Great for baking
      12
      13
    
    
      7
      9.999999e-01
      Great for Baking-
      14
      15
    
    
      8
      9.999999e-01
      Great for baking
      12
      13
    
    
      9
      9.999998e-01
      FINALLY
      97
      99
    
    
      10
      9.999998e-01
      Finally!
      11
      12
    
    
      11
      9.999998e-01
      FINALLY
      69
      74
    
    
      12
      9.999998e-01
      Finally!
      10
      11
    
    
      13
      9.999998e-01
      Finally!
      20
      23
    
    
      14
      9.999995e-01
      best bread
      24
      24
    
    
      15
      9.999994e-01
      Best Honey
      19
      22
    
    
      16
      9.999994e-01
      Best honey
      28
      30
    
    
      17
      9.999988e-01
      Best Value
      15
      15
    
    
      18
      9.999986e-01
      best of the best
      11
      11
    
    
      19
      9.999979e-01
      Best popcorn ever
      13
      14
    
    
      20
      9.999979e-01
      Best popcorn ever!
      18
      18
    
    
      21
      9.999975e-01
      I love these
      21
      21
    
    
      22
      9.999975e-01
      Love these!
      12
      12
    
    
      23
      9.999975e-01
      Love these!!!
      13
      13
    
    
      24
      9.999975e-01
      Love these!
      14
      15
    
    
      25
      9.999972e-01
      Best Gluten Free Bread Mix
      33
      34
    
    
      26
      9.999971e-01
      My Dogs LOVE these
      14
      14
    
    
      27
      9.999971e-01
      My Dogs LOVE these
      14
      14
    
    
      28
      9.999970e-01
      A quality product at a good price, but not wha...
      33
      33
    
    
      29
      9.999967e-01
      GREAT ALTERNATIVE!
      10
      10
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      18422
      8.361379e-07
      Yuck!
      6
      10
    
    
      18423
      8.361379e-07
      Yuck!
      4
      13
    
    
      18424
      8.361379e-07
      Yuck!
      5
      15
    
    
      18425
      8.361379e-07
      YUCK !!!
      8
      11
    
    
      18426
      8.361379e-07
      Yuck!
      6
      17
    
    
      18427
      8.361379e-07
      yuck!!
      13
      17
    
    
      18428
      2.465316e-07
      gross
      5
      18
    
    
      18429
      2.465316e-07
      Gross!
      7
      13
    
    
      18430
      2.465316e-07
      Gross!
      7
      13
    
    
      18431
      2.465316e-07
      Gross!
      7
      13
    
    
      18432
      2.465316e-07
      Gross!
      7
      13
    
    
      18433
      2.465316e-07
      Gross!!!!!
      4
      10
    
    
      18434
      2.465316e-07
      Gross!
      7
      13
    
    
      18435
      2.465316e-07
      Gross!
      4
      11
    
    
      18436
      2.465316e-07
      GROSS
      12
      17
    
    
      18437
      2.465316e-07
      gross
      5
      18
    
    
      18438
      2.465316e-07
      Gross!
      7
      13
    
    
      18439
      2.465316e-07
      Gross!
      7
      13
    
    
      18440
      2.465316e-07
      gross
      5
      18
    
    
      18441
      2.465316e-07
      Gross!
      8
      16
    
    
      18442
      2.465316e-07
      Gross
      9
      12
    
    
      18443
      2.465316e-07
      gross
      5
      18
    
    
      18444
      2.465316e-07
      Gross!
      7
      13
    
    
      18445
      2.465316e-07
      Gross!
      7
      13
    
    
      18446
      2.465316e-07
      gross
      5
      12
    
    
      18447
      2.465316e-07
      GROSS
      9
      22
    
    
      18448
      6.339364e-08
      The worst!!
      6
      16
    
    
      18449
      6.339364e-08
      The worst !!!
      1
      16
    
    
      18450
      6.339364e-08
      The Worst
      1
      11
    
    
      18451
      6.339364e-08
      The worst!!
      6
      16
    
  

18452 rows × 4 columns



In [28]:

    
cm = confusion_matrix(test_ufn["Usefulness"], prediction['Logistic_Usefulness'])
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm_normalized, labels=["useful", "useless"])

Even more complicated pipeline



In [29]:

    
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

# Useful to select only certain features in a dataset for forwarding through a pipeline
# See: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, data_dict):
        return data_dict[self.key]

train_ufn2, test_ufn2 = train_test_split(messages_ufn, test_size=0.2)

ufn_pipe2 = Pipeline([
   ('union', FeatureUnion(
       transformer_list = [
           ('summary', Pipeline([
               ('textsel', ItemSelector(key='Summary_Clean')),
               ('vect', CountVectorizer(min_df = 1, ngram_range = (1, 4))),
               ('tfidf', TfidfTransformer())])),
          ('score', ItemSelector(key=['Score']))
       ],
       transformer_weights = {
           'summary': 0.2,
           'score': 0.8
       }
   )),
   ('model', LogisticRegression(C=1e5))
])

ufn_result2 = ufn_pipe2.fit(train_ufn2, train_ufn2["Usefulness"])
prediction['Logistic_Usefulness2'] = ufn_pipe2.predict(test_ufn2)
print(metrics.classification_report(test_ufn2["Usefulness"], prediction['Logistic_Usefulness2']))









    



             precision    recall  f1-score   support

     useful       0.87      0.89      0.88      2968
    useless       0.79      0.75      0.77      1645

avg / total       0.84      0.84      0.84      4613



In [30]:

    
cm = confusion_matrix(test_ufn2["Usefulness"], prediction['Logistic_Usefulness2'])
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm_normalized, labels=["useful", "useless"])



In [31]:

    
len(ufn_result2.named_steps['model'].coef_[0])









    Out[31]:





86527

Again, let's have a look at the best/worst words:



In [32]:

    
ufn_summary_pipe = next(tr[1] for tr in ufn_result2.named_steps["union"].transformer_list if tr[0]=='summary')
ufn_words = ufn_summary_pipe.named_steps['vect'].get_feature_names()
ufn_features = ufn_words + ["Score"]
ufn_feature_coefs = pd.DataFrame(
    data = list(zip(ufn_features, ufn_result2.named_steps['model'].coef_[0])),
    columns = ['feature', 'coef'])
ufn_feature_coefs.sort_values(by='coef')









    Out[32]:






  
    
      
      feature
      coef
    
  
  
    
      9533
      blk water
      -74.777643
    
    
      50395
      not buy this product
      -72.157559
    
    
      36243
      hot chocolate
      -68.105937
    
    
      32185
      great for
      -66.859561
    
    
      30127
      god awful
      -65.407671
    
    
      58458
      pretty good
      -57.912338
    
    
      50659
      not fresh
      -57.749880
    
    
      2256
      and
      -57.547117
    
    
      77706
      total ripoff
      -56.597793
    
    
      61332
      really really bad
      -56.317598
    
    
      48615
      nasty stuff
      -55.698508
    
    
      9140
      big surprise
      -55.177211
    
    
      39658
      issues
      -55.006595
    
    
      46885
      mold
      -54.870105
    
    
      66780
      splenda alert
      -54.074558
    
    
      20784
      don like
      -53.878187
    
    
      7488
      best
      -53.213790
    
    
      58483
      pretty nice
      -53.145168
    
    
      6016
      bad product
      -52.998426
    
    
      61528
      recipe
      -52.991121
    
    
      5900
      bad beef
      -52.526385
    
    
      204
      absolutely overpriced
      -51.797967
    
    
      84838
      wrong item shipped
      -51.754976
    
    
      13940
      changed
      -51.517844
    
    
      20329
      dog danger
      -50.036589
    
    
      34129
      hard plastic
      -49.770210
    
    
      25579
      flavoring
      -49.759078
    
    
      53544
      ok not
      -49.466632
    
    
      53545
      ok not quite
      -49.466632
    
    
      53546
      ok not quite what
      -49.466632
    
    
      ...
      ...
      ...
    
    
      48771
      nature candy
      52.291246
    
    
      44255
      love this brand
      52.329345
    
    
      34990
      healthy family
      52.529122
    
    
      48372
      my review
      52.734235
    
    
      43504
      little lacking
      52.868384
    
    
      77327
      tomato food
      53.129313
    
    
      4573
      artificial flavor
      53.785953
    
    
      65840
      so versatile
      54.168931
    
    
      50750
      not good product
      56.192680
    
    
      53337
      oh yes
      56.421614
    
    
      80501
      very tasty snack
      56.615730
    
    
      77618
      top cat
      56.941416
    
    
      17500
      crunchy and delicious
      58.100071
    
    
      44075
      love blk water
      58.219841
    
    
      44074
      love blk
      58.219841
    
    
      31237
      good treats
      59.110648
    
    
      34898
      healthy and delicious
      59.671328
    
    
      18479
      definitely happy
      60.045587
    
    
      78763
      unbelievable product
      63.489960
    
    
      43572
      little weak
      64.329545
    
    
      18938
      delish but
      64.895154
    
    
      78237
      truffle oil
      65.825549
    
    
      32361
      great gift basket
      67.056958
    
    
      23050
      excellent formula
      67.515298
    
    
      7033
      beautiful plant
      69.531110
    
    
      32903
      great quality product
      69.783854
    
    
      80139
      very convenient
      70.858109
    
    
      28935
      fun nostalgia
      71.358455
    
    
      49277
      new review
      72.898088
    
    
      1719
      amazingly good coffee
      73.608544
    
  

86527 rows × 2 columns



In [33]:

    
print("And the coefficient of the Score variable: ")
ufn_feature_coefs[ufn_feature_coefs.feature == 'Score']









    



And the coefficient of the Score variable: 






    Out[33]:






  
    
      
      feature
      coef
    
  
  
    
      86526
      Score
      -1.807011



In [ ]:

	feature	coef
967049	worst	-44.423977
983288	yuck	-32.798962
820601	terrible	-30.725437
422085	horrible	-29.568083
61664	awful	-29.549015
587861	not	-29.517080
227249	disgusting	-25.736642
569844	nasty	-25.434613
966932	worse	-23.675443
771474	stale	-23.175920
55684	at best	-23.045667
743490	sick	-22.671042
937834	weak	-22.439037
64217	bad	-22.389412
393691	gross	-22.329017
668716	poor	-22.136753
518607	low quality	-21.973189
443233	instead	-21.778065
222437	didn	-21.766492
804948	tasteless	-21.114726
58868	avoid	-20.743722
697967	rancid	-20.091260
598406	not very good	-20.060497
348817	good and tangy	-18.965442
597882	not too good	-18.902264
582723	no flavor	-18.842567
983599	yuk	-18.803222
669276	poorly	-17.863233
546986	moldy	-17.835281
440440	inedible	-17.749561
...	...	...
272176	fabulous	17.914353
589293	not burnt	18.011484
97636	better than	18.186750
590191	not disappointed	18.459886
516173	loves	18.656262
593258	not like cardboard	19.036431
751726	smooth	19.144529
212288	delicious	19.151504
592145	not greasy	19.260567
285714	finally	19.447863
591578	not from china	19.477991
4758	addictive	19.521455
590828	not expired	19.645489
593574	not made in china	19.661448
278989	favorite	20.092982
412574	heaven	20.168960
348222	good	20.416811
597821	not too	20.731870
17668	amazing	20.782510
508455	love	22.401286
274917	fantastic	24.135847
59431	awesome	24.517802
598472	not very salty	24.682914
596011	not so bad	24.806706
655724	perfect	25.407171
961404	wonderful	26.585667
589143	not bitter	26.717856
588671	not bad	35.914588
368355	great	37.114205
81362	best	42.539952

	Score	Summary	VotesHelpful	VotesTotal	Sentiment	Usefulness	Summary_Clean
32	4	Best of the Instant Oatmeals	19	19	positive	useful	best of the instant oatmeals
33	4	Good Instant	13	13	positive	useful	good instant
75	5	Forget Molecular Gastronomy - this stuff rocke...	15	15	positive	useful	forget molecular gastronomy this stuff rockes ...
145	5	tastes very fresh	17	19	positive	useful	tastes very fresh
195	1	CHANGED FORMULA MAKES CATS SICK!!!!	3	10	negative	useless	changed formula makes cats sick

	0	1	2	3
0	1.000000e+00	best	13	13
1	9.999999e-01	Great for Baking	10	11
2	9.999999e-01	Great for baking!	10	10
3	9.999999e-01	Great for Baking	23	23
4	9.999999e-01	Great for baking	9	10
5	9.999999e-01	Great for baking	32	34
6	9.999999e-01	Great for baking	12	13
7	9.999999e-01	Great for Baking-	14	15
8	9.999999e-01	Great for baking	12	13
9	9.999998e-01	FINALLY	97	99
10	9.999998e-01	Finally!	11	12
11	9.999998e-01	FINALLY	69	74
12	9.999998e-01	Finally!	10	11
13	9.999998e-01	Finally!	20	23
14	9.999995e-01	best bread	24	24
15	9.999994e-01	Best Honey	19	22
16	9.999994e-01	Best honey	28	30
17	9.999988e-01	Best Value	15	15
18	9.999986e-01	best of the best	11	11
19	9.999979e-01	Best popcorn ever	13	14
20	9.999979e-01	Best popcorn ever!	18	18
21	9.999975e-01	I love these	21	21
22	9.999975e-01	Love these!	12	12
23	9.999975e-01	Love these!!!	13	13
24	9.999975e-01	Love these!	14	15
25	9.999972e-01	Best Gluten Free Bread Mix	33	34
26	9.999971e-01	My Dogs LOVE these	14	14
27	9.999971e-01	My Dogs LOVE these	14	14
28	9.999970e-01	A quality product at a good price, but not wha...	33	33
29	9.999967e-01	GREAT ALTERNATIVE!	10	10
...	...	...	...	...
18422	8.361379e-07	Yuck!	6	10
18423	8.361379e-07	Yuck!	4	13
18424	8.361379e-07	Yuck!	5	15
18425	8.361379e-07	YUCK !!!	8	11
18426	8.361379e-07	Yuck!	6	17
18427	8.361379e-07	yuck!!	13	17
18428	2.465316e-07	gross	5	18
18429	2.465316e-07	Gross!	7	13
18430	2.465316e-07	Gross!	7	13
18431	2.465316e-07	Gross!	7	13
18432	2.465316e-07	Gross!	7	13
18433	2.465316e-07	Gross!!!!!	4	10
18434	2.465316e-07	Gross!	7	13
18435	2.465316e-07	Gross!	4	11
18436	2.465316e-07	GROSS	12	17
18437	2.465316e-07	gross	5	18
18438	2.465316e-07	Gross!	7	13
18439	2.465316e-07	Gross!	7	13
18440	2.465316e-07	gross	5	18
18441	2.465316e-07	Gross!	8	16
18442	2.465316e-07	Gross	9	12
18443	2.465316e-07	gross	5	18
18444	2.465316e-07	Gross!	7	13
18445	2.465316e-07	Gross!	7	13
18446	2.465316e-07	gross	5	12
18447	2.465316e-07	GROSS	9	22
18448	6.339364e-08	The worst!!	6	16
18449	6.339364e-08	The worst !!!	1	16
18450	6.339364e-08	The Worst	1	11
18451	6.339364e-08	The worst!!	6	16

	feature	coef
9533	blk water	-74.777643
50395	not buy this product	-72.157559
36243	hot chocolate	-68.105937
32185	great for	-66.859561
30127	god awful	-65.407671
58458	pretty good	-57.912338
50659	not fresh	-57.749880
2256	and	-57.547117
77706	total ripoff	-56.597793
61332	really really bad	-56.317598
48615	nasty stuff	-55.698508
9140	big surprise	-55.177211
39658	issues	-55.006595
46885	mold	-54.870105
66780	splenda alert	-54.074558
20784	don like	-53.878187
7488	best	-53.213790
58483	pretty nice	-53.145168
6016	bad product	-52.998426
61528	recipe	-52.991121
5900	bad beef	-52.526385
204	absolutely overpriced	-51.797967
84838	wrong item shipped	-51.754976
13940	changed	-51.517844
20329	dog danger	-50.036589
34129	hard plastic	-49.770210
25579	flavoring	-49.759078
53544	ok not	-49.466632
53545	ok not quite	-49.466632
53546	ok not quite what	-49.466632
...	...	...
48771	nature candy	52.291246
44255	love this brand	52.329345
34990	healthy family	52.529122
48372	my review	52.734235
43504	little lacking	52.868384
77327	tomato food	53.129313
4573	artificial flavor	53.785953
65840	so versatile	54.168931
50750	not good product	56.192680
53337	oh yes	56.421614
80501	very tasty snack	56.615730
77618	top cat	56.941416
17500	crunchy and delicious	58.100071
44075	love blk water	58.219841
44074	love blk	58.219841
31237	good treats	59.110648
34898	healthy and delicious	59.671328
18479	definitely happy	60.045587
78763	unbelievable product	63.489960
43572	little weak	64.329545
18938	delish but	64.895154
78237	truffle oil	65.825549
32361	great gift basket	67.056958
23050	excellent formula	67.515298
7033	beautiful plant	69.531110
32903	great quality product	69.783854
80139	very convenient	70.858109
28935	fun nostalgia	71.358455
49277	new review	72.898088
1719	amazingly good coffee	73.608544